Diamond Price Prediction with Linear Regression

Table of contents

  • Installing Required Libraries
  • Importing Libraries
  • Importing Dataset
  • EDA
    • Data Profiling
    • Drop Unnecessary Columns : Unnamed: 0
    • Check for Null Values
    • Data Analysis
  • Preprocessing the Data
    • Check for Duplicate Values
    • Encoding Categorical Data
  • Splitting the Dataset into Training Set and Test Set
    • Train Test Split
  • Feature Scaling
  • Machine Learning Models
    • Linear Regression
    • Decision Tree Regression
    • Random Forest Regression
    • XGBRegressor
    • Gradient-Boosting-Regressor Model
    • Ada-Boost-Regressor Model
    • LGMB Regressor Model
    • Cat-Boost-Regressor Model
  • All Models Comparison
  • Conclusion

Installing Required Libraries¶

In [1]:
#%pip install numpy
#%pip install pandas
#%pip install matplotlib
#%pip install ipympl
#%pip install seaborn
#%pip install scikit-learn

%pip install xgboost
%pip install catboost
%pip install lightgbm
Requirement already satisfied: xgboost in c:\users\mohan\anaconda3\lib\site-packages (2.0.1)
Requirement already satisfied: numpy in c:\users\mohan\anaconda3\lib\site-packages (from xgboost) (1.24.3)
Requirement already satisfied: scipy in c:\users\mohan\anaconda3\lib\site-packages (from xgboost) (1.10.1)
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: catboost in c:\users\mohan\anaconda3\lib\site-packages (1.2.2)
Requirement already satisfied: graphviz in c:\users\mohan\anaconda3\lib\site-packages (from catboost) (0.20.1)
Requirement already satisfied: matplotlib in c:\users\mohan\anaconda3\lib\site-packages (from catboost) (3.7.1)
Requirement already satisfied: numpy>=1.16.0 in c:\users\mohan\anaconda3\lib\site-packages (from catboost) (1.24.3)
Requirement already satisfied: pandas>=0.24 in c:\users\mohan\anaconda3\lib\site-packages (from catboost) (1.5.3)
Requirement already satisfied: scipy in c:\users\mohan\anaconda3\lib\site-packages (from catboost) (1.10.1)
Requirement already satisfied: plotly in c:\users\mohan\anaconda3\lib\site-packages (from catboost) (5.9.0)
Requirement already satisfied: six in c:\users\mohan\anaconda3\lib\site-packages (from catboost) (1.16.0)
Requirement already satisfied: python-dateutil>=2.8.1 in c:\users\mohan\anaconda3\lib\site-packages (from pandas>=0.24->catboost) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\mohan\anaconda3\lib\site-packages (from pandas>=0.24->catboost) (2022.7)
Requirement already satisfied: contourpy>=1.0.1 in c:\users\mohan\anaconda3\lib\site-packages (from matplotlib->catboost) (1.0.5)
Requirement already satisfied: cycler>=0.10 in c:\users\mohan\anaconda3\lib\site-packages (from matplotlib->catboost) (0.11.0)
Requirement already satisfied: fonttools>=4.22.0 in c:\users\mohan\anaconda3\lib\site-packages (from matplotlib->catboost) (4.25.0)
Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\mohan\anaconda3\lib\site-packages (from matplotlib->catboost) (1.4.4)
Requirement already satisfied: packaging>=20.0 in c:\users\mohan\anaconda3\lib\site-packages (from matplotlib->catboost) (23.0)
Requirement already satisfied: pillow>=6.2.0 in c:\users\mohan\anaconda3\lib\site-packages (from matplotlib->catboost) (9.4.0)
Requirement already satisfied: pyparsing>=2.3.1 in c:\users\mohan\anaconda3\lib\site-packages (from matplotlib->catboost) (3.0.9)
Requirement already satisfied: tenacity>=6.2.0 in c:\users\mohan\anaconda3\lib\site-packages (from plotly->catboost) (8.2.2)
Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: lightgbm in c:\users\mohan\anaconda3\lib\site-packages (4.1.0)
Requirement already satisfied: numpy in c:\users\mohan\anaconda3\lib\site-packages (from lightgbm) (1.24.3)
Requirement already satisfied: scipy in c:\users\mohan\anaconda3\lib\site-packages (from lightgbm) (1.10.1)
Note: you may need to restart the kernel to use updated packages.

Importing Libraries¶

In [4]:
import numpy as np # linear algebra
import pandas as pd # data processing

#Data visualization libraries
import matplotlib.pyplot as plt # data visualization with matplotlib
import seaborn as sns # data visualization with seaborn
# Interactive plots
%matplotlib inline 
import plotly.express as px
import plotly.graph_objects as go
#Data Profiling
from ydata_profiling import ProfileReport

#Data Preprocessing
from sklearn.preprocessing import StandardScaler

# Machine Learning
from sklearn.model_selection import train_test_split # data split
from sklearn import linear_model
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression  # Linear Regression 
from sklearn.tree import DecisionTreeRegressor # Decision Tree Regression
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor # Random Forest Regression , Gradient Boosting Regression, AdaBoost Regression
from xgboost import XGBRegressor # XGBoost Regression
from catboost import CatBoostRegressor # CatBoost Regression
from lightgbm import LGBMRegressor # LightGBM Regression
from sklearn.metrics import mean_squared_error, r2_score # model evaluation
import statsmodels.api as sm
from scipy import stats

Importing Dataset¶

In [5]:
df=pd.read_csv('diamonds.csv')

Data Overview

In [6]:
df.head()
Out[6]:
Unnamed: 0 carat cut color clarity depth table price x y z
0 1 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 2 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 3 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 4 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 5 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
In [7]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  53940 non-null  int64  
 1   carat       53940 non-null  float64
 2   cut         53940 non-null  object 
 3   color       53940 non-null  object 
 4   clarity     53940 non-null  object 
 5   depth       53940 non-null  float64
 6   table       53940 non-null  float64
 7   price       53940 non-null  int64  
 8   x           53940 non-null  float64
 9   y           53940 non-null  float64
 10  z           53940 non-null  float64
dtypes: float64(6), int64(2), object(3)
memory usage: 4.5+ MB
In [8]:
df.describe()
Out[8]:
Unnamed: 0 carat depth table price x y z
count 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000 53940.000000
mean 26970.500000 0.797940 61.749405 57.457184 3932.799722 5.731157 5.734526 3.538734
std 15571.281097 0.474011 1.432621 2.234491 3989.439738 1.121761 1.142135 0.705699
min 1.000000 0.200000 43.000000 43.000000 326.000000 0.000000 0.000000 0.000000
25% 13485.750000 0.400000 61.000000 56.000000 950.000000 4.710000 4.720000 2.910000
50% 26970.500000 0.700000 61.800000 57.000000 2401.000000 5.700000 5.710000 3.530000
75% 40455.250000 1.040000 62.500000 59.000000 5324.250000 6.540000 6.540000 4.040000
max 53940.000000 5.010000 79.000000 95.000000 18823.000000 10.740000 58.900000 31.800000

EDA¶

Data Profiling¶

UnComment for see and download Data Profile. I Comment it because it take to much time

Overview of the data embedded in this notebook

In [9]:
#dataset_profile=ProfileReport(df, title="Diamond Data Profile")
#dataset_profile.to_notebook_iframe()

Data Profiling Export to HTML

In [10]:
#dataset_profile.to_file("Diamond Detailed Data Profile.html")

Drop Unnecessary Columns : Unnamed: 0¶

In [11]:
df.drop('Unnamed: 0',axis='columns',inplace=True)
df
Out[11]:
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ...
53935 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50
53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61
53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64

53940 rows × 10 columns

Check for Null Values¶

In [12]:
df.isna().sum()
Out[12]:
carat      0
cut        0
color      0
clarity    0
depth      0
table      0
price      0
x          0
y          0
z          0
dtype: int64

There is no null values in the dataset

Data Analysis¶

Price Distribution

In [13]:
plt.figure(figsize=(20,8))
sns.histplot(x=df['price'],bins=50,kde=True)
plt.tight_layout()
plt.show()

Relation between Price and Carat, Cut, Color, Clarity, Depth, Table, X, Y, Z

In [14]:
fig,ax=plt.subplots(2,3,figsize=(30,16))
i=0;j=0
for col in (df.select_dtypes(include='float64')):
    sns.scatterplot(x=col,y='price',data=df,color='green',ax=ax[i,j])
    j+=1
    if(j==3):
        j=0
        i+=1
plt.tight_layout()
plt.show()
In [15]:
# calculate the correlation matrix on the numeric columns 
corr = df.select_dtypes('number').corr()

# plot the heatmap
#sns.heatmap(corr, annot=True)
heatmap = sns.heatmap(df.corr(), vmin=-1, vmax=1, annot=True)
# Give a title to the heatmap. Pad defines the distance of the title from the top of the heatmap.
heatmap.set_title('Correlation Heatmap', fontdict={'fontsize':12}, pad=12);
C:\Users\mohan\AppData\Local\Temp\ipykernel_72768\2352083839.py:6: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  heatmap = sns.heatmap(df.corr(), vmin=-1, vmax=1, annot=True)

Relation between Price and Cut , Color , Clarity (Categorical)

In [16]:
fig,ax=plt.subplots(1,3,figsize=(30,8))
palette=['deep','coolwarm',None]
for i,col in enumerate(df.select_dtypes(include='object').columns):
    sns.barplot(x=col,y='price',data=df,ax=ax[i],palette=palette[i])
plt.tight_layout()
plt.show()

Count of Cut, Color, Clarity (Categorical)

In [17]:
fig,ax=plt.subplots(1,3,figsize=(30,8))
palette=['deep','coolwarm',None]
for i,col in enumerate(df.select_dtypes(include='object').columns):
    sns.countplot(x=col,data=df,ax=ax[i],palette=palette[i])
plt.tight_layout()
plt.show()

No such diamond can exist whose length or width or depth is zero, so entries with any of these are abnormal and thus dropping them. Also, elements with width(y)>30 and depth(z)>30 seems to be outliers, so removing them too.

In [18]:
#Make a copy of the original dataset
data_new = df.copy()
data_new
Out[18]:
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ...
53935 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50
53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61
53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64

53940 rows × 10 columns

Drop the rows with x=0, y=0, z=0 and y>30, z>30

In [19]:
data_new.drop(data_new.loc[(data_new['x']==0)|(data_new['y']==0)|(data_new['z']==0)|(data_new['y']>30)|(data_new['z']>30)].index,inplace=True)
data_new
Out[19]:
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ...
53935 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50
53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61
53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64

53917 rows × 10 columns

Now we visualize the data after removing the outliers of x, y, z

In [20]:
fig,ax=plt.subplots(2,3,figsize=(30,16))
i=0;j=0
for col in (df.select_dtypes(include='float64')):
    sns.scatterplot(x=col,y='price',data=df,color='green',ax=ax[i,j])
    j+=1
    if(j==3):
        j=0
        i+=1
plt.tight_layout()
plt.show()

Now we can see that as the carat increases, the price also increases. So, we can say that carat is directly proportional to price.

As we can see the elements with table>80 are outliers, so removing them.

In [21]:
data_new.drop(data_new.loc[data_new['table']>80].index,inplace=True)
data_new
Out[21]:
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ...
53935 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50
53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61
53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64

53916 rows × 10 columns

Preprocessing the Data¶

Check for Duplicate Values¶

In [22]:
data_new[data_new.duplicated()]
data_new.drop_duplicates(inplace=True)
data_new
Out[22]:
carat cut color clarity depth table price x y z
0 0.23 Ideal E SI2 61.5 55.0 326 3.95 3.98 2.43
1 0.21 Premium E SI1 59.8 61.0 326 3.89 3.84 2.31
2 0.23 Good E VS1 56.9 65.0 327 4.05 4.07 2.31
3 0.29 Premium I VS2 62.4 58.0 334 4.20 4.23 2.63
4 0.31 Good J SI2 63.3 58.0 335 4.34 4.35 2.75
... ... ... ... ... ... ... ... ... ... ...
53935 0.72 Ideal D SI1 60.8 57.0 2757 5.75 5.76 3.50
53936 0.72 Good D SI1 63.1 55.0 2757 5.69 5.75 3.61
53937 0.70 Very Good D SI1 62.8 60.0 2757 5.66 5.68 3.56
53938 0.86 Premium H SI2 61.0 58.0 2757 6.15 6.12 3.74
53939 0.75 Ideal D SI2 62.2 55.0 2757 5.83 5.87 3.64

53771 rows × 10 columns

As you can see there are 145 duplicates in the dataset. So, we will remove them.

Encoding Categorical Data¶

Encode the categorical data to numerical data so that the machine learning model can understand it. In this dataset we have to encode the Cut, Color and Clarity columns. pd.get_dummies() method encodes the categorical data by creating Dummy columns for each category Dropping the y,z columns as they have p values more than 0.05, it is less significant

In [23]:
#Keep Original Data for further actions
data_Categorical=data_new.copy()
In [24]:
data_new_ready=pd.get_dummies(data_new,columns=['cut','color','clarity'],drop_first=True)

Regression 1 with Dummy variables - with all predictors¶

In [25]:
target='price'
X0=data_new_ready.drop([target],axis=1)
y0=data_new_ready[[target]]
In [26]:
X0_train,X0_test,y0_train,y0_test=train_test_split(X0,y0,test_size=0.3,random_state=0,)
In [27]:
Xi = sm.add_constant(X0_train)
esti = sm.OLS(y0_train, Xi)
esti2 = esti.fit()
print(esti2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.921
Model:                            OLS   Adj. R-squared:                  0.921
Method:                 Least Squares   F-statistic:                 1.900e+04
Date:                Wed, 06 Dec 2023   Prob (F-statistic):               0.00
Time:                        17:42:06   Log-Likelihood:            -3.1768e+05
No. Observations:               37639   AIC:                         6.354e+05
Df Residuals:                   37615   BIC:                         6.356e+05
Df Model:                          23                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const         -4412.8443    755.788     -5.839      0.000   -5894.209   -2931.480
carat          1.157e+04     61.803    187.189      0.000    1.14e+04    1.17e+04
depth            48.8664     10.960      4.458      0.000      27.384      70.349
table           -23.4001      3.474     -6.736      0.000     -30.209     -16.591
x             -1552.9721    122.308    -12.697      0.000   -1792.698   -1313.246
y              1673.1020    125.434     13.338      0.000    1427.248    1918.956
z             -2068.0472    168.343    -12.285      0.000   -2398.004   -1738.090
cut_Good        502.4655     40.593     12.378      0.000     422.902     582.029
cut_Ideal       773.3659     40.234     19.222      0.000     694.506     852.226
cut_Premium     737.9933     38.386     19.226      0.000     662.757     813.230
cut_Very Good   643.2127     39.308     16.363      0.000     566.168     720.257
color_E        -212.7592     21.218    -10.027      0.000    -254.346    -171.172
color_F        -267.4972     21.428    -12.484      0.000    -309.496    -225.498
color_G        -480.0944     21.015    -22.846      0.000    -521.284    -438.905
color_H        -983.5658     22.439    -43.834      0.000   -1027.546    -939.586
color_I       -1479.1774     25.202    -58.692      0.000   -1528.575   -1429.780
color_J       -2384.6537     31.072    -76.747      0.000   -2445.555   -2323.752
clarity_IF     5342.6712     60.890     87.743      0.000    5223.325    5462.018
clarity_SI1    3669.8884     51.820     70.820      0.000    3568.320    3771.457
clarity_SI2    2718.7299     51.998     52.285      0.000    2616.813    2820.647
clarity_VS1    4583.3116     52.903     86.637      0.000    4479.621    4687.002
clarity_VS2    4260.1940     52.090     81.785      0.000    4158.096    4362.292
clarity_VVS1   5002.4335     56.086     89.193      0.000    4892.504    5112.363
clarity_VVS2   4942.7829     54.487     90.715      0.000    4835.988    5049.578
==============================================================================
Omnibus:                    10020.883   Durbin-Watson:                   1.995
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           511304.234
Skew:                           0.472   Prob(JB):                         0.00
Kurtosis:                      21.032   Cond. No.                     1.13e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.13e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
In [28]:
# modify figure size 
fig = plt.figure(figsize=(14, 8)) 
  
# creating regression plots 
fig = sm.graphics.plot_regress_exog(esti2, 
                                    'carat', 
                                    fig=fig) 
eval_env: 1

Regression 2 with Dummy variables - dropping insignificant y,z predictors¶

In [29]:
data_new_ready = data_new_ready.drop(['y', 'z'], axis = 1)

Splitting the Dataset into Training Set and Test Set¶

In [30]:
data_new_ready
Out[30]:
carat depth table price x cut_Good cut_Ideal cut_Premium cut_Very Good color_E ... color_H color_I color_J clarity_IF clarity_SI1 clarity_SI2 clarity_VS1 clarity_VS2 clarity_VVS1 clarity_VVS2
0 0.23 61.5 55.0 326 3.95 0 1 0 0 1 ... 0 0 0 0 0 1 0 0 0 0
1 0.21 59.8 61.0 326 3.89 0 0 1 0 1 ... 0 0 0 0 1 0 0 0 0 0
2 0.23 56.9 65.0 327 4.05 1 0 0 0 1 ... 0 0 0 0 0 0 1 0 0 0
3 0.29 62.4 58.0 334 4.20 0 0 1 0 0 ... 0 1 0 0 0 0 0 1 0 0
4 0.31 63.3 58.0 335 4.34 1 0 0 0 0 ... 0 0 1 0 0 1 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
53935 0.72 60.8 57.0 2757 5.75 0 1 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
53936 0.72 63.1 55.0 2757 5.69 1 0 0 0 0 ... 0 0 0 0 1 0 0 0 0 0
53937 0.70 62.8 60.0 2757 5.66 0 0 0 1 0 ... 0 0 0 0 1 0 0 0 0 0
53938 0.86 61.0 58.0 2757 6.15 0 0 1 0 0 ... 1 0 0 0 0 1 0 0 0 0
53939 0.75 62.2 55.0 2757 5.83 0 1 0 0 0 ... 0 0 0 0 0 1 0 0 0 0

53771 rows × 22 columns

I set target variable as Price and rest of the variables as features.

In [31]:
target='price'
X=data_new_ready.drop([target],axis=1)
y=data_new_ready[[target]]

Cheking the content of x , y

In [32]:
X.head(1)
Out[32]:
carat depth table x cut_Good cut_Ideal cut_Premium cut_Very Good color_E color_F ... color_H color_I color_J clarity_IF clarity_SI1 clarity_SI2 clarity_VS1 clarity_VS2 clarity_VVS1 clarity_VVS2
0 0.23 61.5 55.0 3.95 0 1 0 0 1 0 ... 0 0 0 0 0 1 0 0 0 0

1 rows × 21 columns

In [33]:
y.head(1)
Out[33]:
price
0 326

Train Test Split¶

In [34]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=0,)

Feature Scaling¶

We have to scale the features so that the machine learning model can understand it.

In [35]:
#SC_X = StandardScaler()
#X_train = SC_X.fit_transform(X_train)
#X_test = SC_X.transform(X_test)

Machine Learning Models¶

First define a function for Model Evaluation

In [36]:
training_score = []
testing_score = []
rmse=[]
In [37]:
def model_prediction(model):
    d = model.fit(X_train,y_train)
    y_train_pred = model.predict(X_train)
    y_test_pred = model.predict(X_test)
    a = r2_score(y_train,y_train_pred)*100
    b = r2_score(y_test,y_test_pred)*100
    c = mean_squared_error(y_test, y_test_pred)
    training_score.append(a)
    testing_score.append(b)
    rmse.append(c)


    print(f"r2_Score of {model} model on Training Data is:",a)
    print(f"r2_Score of {model} model on Testing Data is:",b)
    print(f"RMSE of {model} model on Testing Data is:",c) 

Linear Regression¶

In [38]:
model_prediction(LinearRegression())
r2_Score of LinearRegression() model on Training Data is: 92.02780118315786
r2_Score of LinearRegression() model on Testing Data is: 91.99397608062448
RMSE of LinearRegression() model on Testing Data is: 1281621.0212671217
In [39]:
model_prediction(DecisionTreeRegressor())
r2_Score of DecisionTreeRegressor() model on Training Data is: 99.99908658591109
r2_Score of DecisionTreeRegressor() model on Testing Data is: 95.21759453879659
RMSE of DecisionTreeRegressor() model on Testing Data is: 765577.4493088272
In [41]:
X2 = sm.add_constant(X_train)
est = sm.OLS(y_train, X2)
est2 = est.fit()
print(est2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  price   R-squared:                       0.920
Model:                            OLS   Adj. R-squared:                  0.920
Method:                 Least Squares   F-statistic:                 2.068e+04
Date:                Wed, 06 Dec 2023   Prob (F-statistic):               0.00
Time:                        17:42:44   Log-Likelihood:            -3.1779e+05
No. Observations:               37639   AIC:                         6.356e+05
Df Residuals:                   37617   BIC:                         6.358e+05
Df Model:                          21                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
const          3273.8578    467.079      7.009      0.000    2358.370    4189.345
carat          1.154e+04     61.707    187.042      0.000    1.14e+04    1.17e+04
depth           -76.4341      4.863    -15.716      0.000     -85.967     -66.902
table           -24.9481      3.478     -7.173      0.000     -31.765     -18.131
x             -1148.7170     26.132    -43.958      0.000   -1199.937   -1097.497
cut_Good        598.4588     39.983     14.968      0.000     520.091     676.827
cut_Ideal       851.7280     39.872     21.362      0.000     773.578     929.878
cut_Premium     777.4750     38.411     20.241      0.000     702.189     852.761
cut_Very Good   746.3731     38.423     19.425      0.000     671.064     821.682
color_E        -211.9465     21.281     -9.959      0.000    -253.658    -170.235
color_F        -267.4180     21.492    -12.443      0.000    -309.543    -225.293
color_G        -480.6893     21.077    -22.806      0.000    -522.002    -439.377
color_H        -986.7398     22.503    -43.849      0.000   -1030.847    -942.633
color_I       -1474.0207     25.275    -58.319      0.000   -1523.561   -1424.481
color_J       -2381.1576     31.164    -76.407      0.000   -2442.240   -2320.075
clarity_IF     5395.2985     60.904     88.587      0.000    5275.925    5514.672
clarity_SI1    3714.5770     51.850     71.641      0.000    3612.950    3816.204
clarity_SI2    2756.4792     52.065     52.943      0.000    2654.430    2858.528
clarity_VS1    4629.4998     52.918     87.485      0.000    4525.780    4733.220
clarity_VS2    4302.1754     52.128     82.531      0.000    4200.003    4404.348
clarity_VVS1   5049.8087     56.109     89.999      0.000    4939.833    5159.784
clarity_VVS2   4991.0682     54.500     91.580      0.000    4884.248    5097.889
==============================================================================
Omnibus:                    10002.758   Durbin-Watson:                   1.996
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           528015.336
Skew:                           0.455   Prob(JB):                         0.00
Kurtosis:                      21.326   Cond. No.                     6.85e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 6.85e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
In [42]:
# modify figure size 
fig = plt.figure(figsize=(14, 8)) 
  
# creating regression plots 
fig = sm.graphics.plot_regress_exog(est2, 
                                    'carat', 
                                    fig=fig) 
eval_env: 1
In [43]:
# Assuming 'model' is your fitted linear regression model
residuals = est2.resid

# Create a residual normal distribution plot
plt.figure(figsize=(8, 6))
plt.hist(residuals, bins=30, density=True, color='blue', alpha=0.7)
mu, sigma = stats.norm.fit(residuals)
xmin, xmax = plt.xlim()
x = np.linspace(xmin, xmax, 100)
p = stats.norm.pdf(x, mu, sigma)
plt.plot(x, p, 'k', linewidth=2)
title = "Fit results: mu = %.2f,  std = %.2f" % (mu, sigma)
plt.title(title)
plt.xlabel('Residuals')
plt.ylabel('Frequency')
plt.show()

Random Forest Regression¶

In [44]:
model_prediction(RandomForestRegressor())
C:\Users\mohan\AppData\Local\Temp\ipykernel_72768\1466626044.py:2: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  d = model.fit(X_train,y_train)
r2_Score of RandomForestRegressor() model on Training Data is: 99.63949723600446
r2_Score of RandomForestRegressor() model on Testing Data is: 97.53962108446606
RMSE of RandomForestRegressor() model on Testing Data is: 393862.59274087363

XGBRegressor¶

In [45]:
model_prediction(XGBRegressor())
r2_Score of XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=None,
             num_parallel_tree=None, random_state=None, ...) model on Training Data is: 98.71031434640751
r2_Score of XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=None,
             num_parallel_tree=None, random_state=None, ...) model on Testing Data is: 97.8180153412913
RMSE of XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=None,
             num_parallel_tree=None, random_state=None, ...) model on Testing Data is: 349296.65897145646

Gradient-Boosting-Regressor Model¶

In [46]:
model_prediction(GradientBoostingRegressor())
C:\Users\mohan\anaconda3\Lib\site-packages\sklearn\ensemble\_gb.py:437: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
r2_Score of GradientBoostingRegressor() model on Training Data is: 95.42919056247987
r2_Score of GradientBoostingRegressor() model on Testing Data is: 95.35198883069714
RMSE of GradientBoostingRegressor() model on Testing Data is: 744063.3305186994

Ada-Boost-Regressor Model¶

In [47]:
model_prediction(AdaBoostRegressor())
C:\Users\mohan\anaconda3\Lib\site-packages\sklearn\utils\validation.py:1143: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples, ), for example using ravel().
  y = column_or_1d(y, warn=True)
r2_Score of AdaBoostRegressor() model on Training Data is: 84.70057605755476
r2_Score of AdaBoostRegressor() model on Testing Data is: 84.67999292256461
RMSE of AdaBoostRegressor() model on Testing Data is: 2452458.712855532

LGMB Regressor Model¶

In [48]:
model_prediction(LGBMRegressor())
[LightGBM] [Warning] Found whitespace in feature_names, replace with underlines
[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002506 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 771
[LightGBM] [Info] Number of data points in the train set: 37639, number of used features: 21
[LightGBM] [Info] Start training from score 3931.778103
r2_Score of LGBMRegressor() model on Training Data is: 98.38651059955792
r2_Score of LGBMRegressor() model on Testing Data is: 97.96913972681276
RMSE of LGBMRegressor() model on Testing Data is: 325104.3518710946

Cat-Boost-Regressor Model¶

In [49]:
model_prediction(CatBoostRegressor(verbose=False))
r2_Score of <catboost.core.CatBoostRegressor object at 0x000002078B65C290> model on Training Data is: 98.52147606503685
r2_Score of <catboost.core.CatBoostRegressor object at 0x000002078B65C290> model on Testing Data is: 97.95277938307015
RMSE of <catboost.core.CatBoostRegressor object at 0x000002078B65C290> model on Testing Data is: 327723.34984897176

All Models Comparison¶

create a dataframe for all models comparison

In [50]:
models = ["Linear Regression","Decision Tree Regression","Random Forest Regression","XGBoost" ,"Gradient Boosting Regression","AdaBoost Regression","LGBM Regression","CatBoost Regression"]
In [51]:
compare_models = pd.DataFrame({"Algorithms":models,
                   "Training Score":training_score,
                   "Testing Score":testing_score,"RMSE":rmse})
compare_models
Out[51]:
Algorithms Training Score Testing Score RMSE
0 Linear Regression 92.027801 91.993976 1.281621e+06
1 Decision Tree Regression 99.999087 95.217595 7.655774e+05
2 Random Forest Regression 99.639497 97.539621 3.938626e+05
3 XGBoost 98.710314 97.818015 3.492967e+05
4 Gradient Boosting Regression 95.429191 95.351989 7.440633e+05
5 AdaBoost Regression 84.700576 84.679993 2.452459e+06
6 LGBM Regression 98.386511 97.969140 3.251044e+05
7 CatBoost Regression 98.521476 97.952779 3.277233e+05

Plotting the graph of all the models using their R2 Score with bar plot

In [52]:
compare_models.plot(x="Algorithms",y=["Training Score","Testing Score"], figsize=(16,6),kind="bar",title="Performance Visualization of Different Models by R2Score",colormap="rainbow")
plt.show()

Plotting the graph of all the models using their RMSE with bar plot

In [53]:
compare_models.plot(x="Algorithms",y=["RMSE"], figsize=(16,6),kind="bar",title="Performance Visualization of Different Models by R2Score",colormap="Dark2")
plt.show()

Conclusion¶

The linear regression model has a strong fit with an R-squared of 0.92, indicating that approximately 92% of the variance in the dependent variable is explained by the independent variables. All predictor variables, including carat, cut, color, clarity, and geometric features, are statistically significant (p < 0.001), contributing meaningfully to the model The model's F-statistic is highly significant (p < 2.2e-16), indicating that the overall regression equation is meaningful. Residual analysis shows a symmetric distribution with a median residual of 0, suggesting that, on average, predicted values underestimate observed values by this amount The model's intercept is 2366.086, and the coefficient for carat is 11256.968, indicating a strong positive relationship between carat and the dependent variable (price) - β0, β2..….., βn have meaning intervals Categorical variables such as cut, color, and clarity have varying effects on the dependent variable, with distinct coefficients for each category, which explains the model well The residual standard error reduced to 1130 after backward regression, providing a measure of the typical deviation of observed values from predicted values.